tanh).where $a(\cdot)$ denotes an activation function, and the superscripts on $f$ and $w_0,\ldots,w_N$ indicate they represent a single-layer unit and its internal weights, respectively.
1: Choose an activation function $a\left(\cdot\right)$
2: Compute the linear combination: $\,\,\,\,\,\,\,\,\, v = w_{0}^{(1)}+{\sum_{n=1}^{N}}{w_{n}^{(1)}\,x_n}$
3: Pass result through activation: $\,\,\, a\left(v\right)$
4: output: Single layer unit $\,\, a\left(v\right)$
tanh as the nonlinear activation function:where in each instance the internal parameters of the unit (i.e., $w_0^{(1)} + w_1^{(1)}$) have been randomly set, giving each instance a distinct shape.
demo.show_1d_net(num_layers = 1, activation = 'tanh')
ReLU as the nonlinear activation function:tanh activation).demo.show_1d_net(num_layers = 1, activation = 'relu')
whose $j^{th}$ unit takes the form:
\begin{equation*} f^{(1)}_j\left(\mathbf{x}\right)=a\left(w^{\left(1\right)}_{0,\,j}+\underset{n=1}{\overset{N}{\sum}}{w^{\left(1\right)}_{n,\,j}\,x_n}\right) \end{equation*}We denote \begin{equation*} \mathring{\mathbf{x}}=\left[\begin{array}{c} 1\\ x_{1}\\ \vdots\\ x_{N} \end{array}\right] \end{equation*}
We collect all internal parameters of our $U_1$ single-layer units and place them into a $(N+1)\times U_1$ matrix $\mathbf{W}_1$
1: Choose an activation function $a\left(\cdot\right)$, number of single layer units $U_1$
2: Construct $U_1$ single-layer units: $\,\,\,\,\,\,\,\,\,f^{(1)}_i\left(\mathbf{x}\right)$ for $i=1,\,...,U_1$
3: Compute the linear combination: $\,\,\,\,\,\,\, v = w_{0}^{(2)}+{\sum_{i=1}^{U_1}}{w_{i}^{(2)}\,f^{(1)}_i}\left(\mathbf{x}\right)$
4: Pass the result through activation: $\,\,\,\, a\left(v\right)$
5: output: Two layer unit $\,\, a\left(v\right)$
tanh:
\begin{equation*}
f^{(2)}(x) = \text{tanh}\left(w_0^{(2)} + w_1^{(2)}\,f^{(1)}(x)\right)
\end{equation*}
where
\begin{equation*}
f^{(1)}(x) = \text{tanh}\left(w_0^{(1)} + w_1^{(1)}x\right)
\end{equation*} demo.show_1d_net(num_layers = 2, activation = 'tanh')
ReLU:
\begin{equation*}
f^{(2)}(x) = \text{max}\left(0,w_0^{(2)} + w_1^{(2)}\,f^{(1)}(x)\right)
\end{equation*}
where
\begin{equation*}
f^{(1)}(x) = \text{max}\left(0, w_0^{(1)} + w_1^{(1)}x\right)
\end{equation*} demo.show_1d_net(num_layers = 2, activation = 'relu')
where the $j^{th}$ two-layer unit looks like:
\begin{equation*} f^{\left(2\right)}_j\left(\mathbf{x}\right)=a\left(w^{\left(2\right)}_{0.j}+\underset{i=1}{\overset{U_1}{\sum}}{w^{\left(2\right)}_{i,j}}\,f^{(1)}_i\left(\mathbf{x}\right)\right) \end{equation*}which mirrors precisely how we defined the $\left(N+1\right) \times U_1$ internal weight matrix $\mathbf{W}_1$ for our single-layer units.
We can construct similarly general fully connected neural network units with an arbitrary number of hidden layers.
With each hidden layer added, we increase the capacity of a neural network unit.
1: Choose an activation function $a\left(\cdot\right)$, number of single $L-1$ layer units $U_{L-1}$
2: Construct $(L-1)$-layer units: $\,\,\,\,\,\,\,\,\,f^{(L-1)}_i\left(\mathbf{x}\right)$ for $i=1,\,...,U_{L-1}$
3: Compute the linear combination: $\,\,\,\,\,\,\, v = w_{0}^{(L)}+{\sum_{i=1}^{U_{L-1}}}{w_{i}^{(L)}\,f^{(L-1)}_i}\left(\mathbf{x}\right)$
4: Pass the result through activation: $\,\,\,\, a\left(v\right)$
5: output: $L$-layer unit $\,\, a\left(v\right)$
tanh activation. Compared to single- and two-layer neural network units, three-layer units have increased capacity.demo = DrawBases.Visualizer()
demo.show_1d_net(num_layers = 3, activation = 'tanh')
ReLU activation. Compared to single- and two-layer neural network units, three-layer units have increased capacity.demo = DrawBases.Visualizer()
demo.show_1d_net(num_layers = 3, activation = 'relu')
In general we can produce a model consisting of $B=U_L$ such $L$-layer units as:
\begin{equation*} \text{model}\left(\mathbf{x},\Theta\right) = w_0 + f^{(L)}_1\left(\mathbf{x}\right)w_1 + \cdots + f^{(L)}_{U_L}\left(\mathbf{x}\right)w_{U_L} \end{equation*}where
\begin{equation*} f^{\left(L\right)}_j\left(\mathbf{x}\right)=a\left(w^{\left(L\right)}_{0.j}+\underset{i=1}{\overset{U_{L-1}}{\sum}}{w^{\left(L\right)}_{i,j}}\,f^{(L-1)}_i\left(\mathbf{x}\right)\right) \end{equation*}and where the parameter set $\Theta$ contains both those weights internal to the neural network units as well as the final linear combination weights.
How do we choose the ''right'' number of units and layers for a neural network architecture?
# a feature_transforms function for computing
# U_L L-layer perceptron units
def feature_transforms(a, w):
# loop through each layer matrix
for W in w:
# compute inner product with current layer weights
a = W[0] + np.dot(a.T, W[1:])
# pass through activation
a = activation(a).T
return a
# an implementation of our model employing a nonlinear feature transformation
def model(x,w):
# feature transformation
f = feature_transforms(x,w[0])
# compute linear combination and return
a = w[1][0] + np.dot(f.T,w[1][1:])
return a.T
# create initial weights for arbitrary feedforward network
def initialize_network_weights(layer_sizes, scale):
# container for entire weight tensor
weights = []
# loop over desired layer sizes and create appropriately sized initial
# weight matrix for each layer
for k in range(len(layer_sizes)-1):
# get layer sizes for current weight matrix
U_k = layer_sizes[k]
U_k_plus_1 = layer_sizes[k+1]
# make weight matrix
weight = scale*np.random.randn(U_k+1,U_k_plus_1)
weights.append(weight)
# re-express weights so that w_init[0] = omega_inner contains all
# internal weight matrices, and w_init = w contains weights of
# final linear combination in predict function
w_init = [weights[:-1],weights[-1]]
return w_init
data_path_1 = './2_eggs.csv'
# create instance of linear regression demo, used below and in the next examples
demo5 = nonlinear_classification_visualizer.Visualizer(data_path_1)
x = demo5.x.T
y = demo5.y[np.newaxis,:]
# an implementation of the least squares cost function for linear regression for N = 2 input dimension datasets
demo5.plot_data();
# An example 4 hidden layer network, with 10 units in each layer
N = 2 # dimension of input
M = 1 # dimension of output
U_1 = 10; U_2 = 10; U_3 = 10; # number of units per hidden layer
# the list defines our network architecture
layer_sizes = [N, U_1,U_2,U_3,M]
# generate initial weights for our network
w = initialize_network_weights(layer_sizes, scale = 0.5)
# initialize with input/output data
mylib5 = super_setup.Setup(x,y)
# perform preprocessing step(s) - especially input normalization
mylib5.preprocessing_steps(normalizer = 'standard')
# split into training and validation sets
mylib5.make_train_val_split(train_portion = 1)
# choose cost
mylib5.choose_cost(name = 'softmax')
# choose dimensions of fully connected multilayer perceptron layers
layer_sizes = [10,10,10]
mylib5.choose_features(feature_name = 'multilayer_perceptron',layer_sizes = layer_sizes,activation = 'tanh',scale = 0.5)
# fit an optimization
mylib5.fit(max_its = 1000,alpha_choice = 10**(-1),verbose = False)
mylib5.show_histories()
ind = np.argmax(mylib5.train_accuracy_histories[0])
w_best = mylib5.weight_histories[0][ind]
demo5.static_N2_simple(w_best,mylib5,view = [30,155])
data_path_2 = './3_layercake_data.csv'
# create an instance of a multiclass classification visualizer
demo3 = nonlinear_classification_visualizer.Visualizer(data_path_2)
x = demo3.x.T
y = demo3.y[np.newaxis,:]
demo3.plot_data();
# define the number of units to use in each layer
N = 2 # dimension of input
U_1 = 12 # number of single layer units to employ
U_2 = 5 # number of two layer units to employ
# initialize internal weights of units in hidden layers
W_1 = 0.1*np.random.randn(N+1,U_1)
W_2 = 0.1*np.random.randn(U_1+1,U_2)
# initialize weights of our linear combination
w_3 = 0.1*np.random.randn(U_2+1,3)
# package all weights together in a single list
w = [W_1,W_2,w_3]
# initialize with input/output data
mylib3 = super_setup.Setup(x,y)
# perform preprocessing step(s) - especially input normalization
mylib3.preprocessing_steps(normalizer = 'standard')
# split into training and validation sets
mylib3.make_train_val_split(train_portion = 1)
# choose cost
mylib3.choose_cost(name = 'multiclass_softmax')
layer_sizes = [12,5]
mylib3.choose_features(feature_name = 'multilayer_perceptron',layer_sizes = layer_sizes,activation = 'tanh',scale = 0.1)
# fit an optimization
mylib3.fit(max_its = 5000,alpha_choice = 10**(-1),verbose = False)
mylib3.show_histories()
# pluck out best weights - those that provided lowest cost,
ind = np.argmax(mylib3.train_accuracy_histories[0])
w_best = mylib3.weight_histories[0][ind]
demo3.multiclass_plot(mylib3,w_best)
fig = plt.figure(figsize = (6,5))
gs = gridspec.GridSpec(2, 1)
ax = plt.subplot(gs[0]);
ax.plot(w,a);
ax = plt.subplot(gs[1]);
ax.plot(w,b);
tanh) typically perform better than the same network employing logistic sigmoid activations, because the function itself centers its output about the origin.tanh likewise maps input values away from the origin to output values very close to zero, neural networks employing the tanh activation can also suffer from the vanishing gradient problem.ReLU has quickly become the most popular activation function in use today.ReLU function only maps negative input values to zero, networks employing this activation function tend not to suffer from the vanishing gradient problem.ReLU activations should be initialized away from the origin to avoid too many of the units (and their gradients) from disappearing.fig = plt.figure(figsize = (8,5))
gs = gridspec.GridSpec(2, 2)
ax = plt.subplot(gs[0]);
ax.plot(w,a);
ax = plt.subplot(gs[1]);
ax.plot(w,a2);
ax = plt.subplot(gs[2]);
ax.plot(w,b);
ax = plt.subplot(gs[3]);
ax.plot(w,b2);
Let's consider the logistic regression model: \begin{equation*} \sigma\left(w_0 + w_1x \right). \end{equation*}
This can be viewed through the lens of a single-hidden-layer neural network with scalar input and logistic sigmoid activation: \begin{equation*} \text{model}\left(x,\Theta\right) = w_0^{(2)} + w_1^{(2)}\sigma\left(w_0^{(1)} + w_1^{(1)}x \right) \end{equation*} where $w_0^{(2)}=0$ and $w_1^{(2)}=1$.
demo = LS_sigmoid.Visualizer(data)
demo.plot_costs(viewmax = 25, view = [21,121])
tanh activation.mylib0.show_histories(labels = labels)
mylib0.show_histories(labels = labels)
where $f_1^{(L)}\ldots f_{U_L}^{(L)}$ are $L$-layer units, similarly makes tuning the parameters of such a model easier.
wherein the $n^{th}$ dimension of the input $x_n$ only touches the internal weight $w^{\left(1\right)}_{n,\,j}$.
where
\begin{array} \ \mu_{f_j^{(1)}} = \frac{1}{P}\sum_{p=1}^{P}f_j^{(1)}\left(\mathbf{x}_p \right) \\ \sigma_{f_j^{(1)}} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f_j^{(1)}\left(\mathbf{x}_p \right) - \mu_{f_j^{(1)}} \right)^2}. \end{array}1: input: Activation function $a\left(\cdot\right)$ and input data $\left\{\mathbf{x}_p\right\}_{p=1}^P$
2: Compute linear combination: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,v = w_{0}^{(1)}+{\sum_{n=1}^{N}}{w_{n}^{(1)}\,x_n}$
3: Pass result through activation: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(1)}\left(\mathbf{x}\right) = a\left(v\right)$
4: Compute mean: $\mu_{f^{(1)}}$ / standard deviation $\sigma_{f^{(1)}}$ of: $\,\,\,\,\,\,\,\,\left\{f^{(1)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
5: Standard normalize: $ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,
f^{(1)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(1)} \left(\mathbf{x}\right) - \mu_{f^{(1)}}}{\sigma_{f^{(1)}}}
$
6: output: Batch normalized single layer unit $\,\, f^{(1)} \left(\mathbf{x} \right)$
ReLU units $f_1^{(1)}$ and $f_2^{(1)}$, applied to performing two-class classification of a toy dataset.viewer.plot_data()
show_video(video_path_1)
We repeat the same experiment but now with batch-normalized single-layer units.
show_video(video_path_2)
via the substitution:
\begin{equation*} f^{(L)}_j \left(\mathbf{x} \right) \longleftarrow \frac{f^{(L)}_j \left(\mathbf{x}_p\right) - \mu_{f^{(L)}_j}}{\sigma_{f^{(L)}_j}} \end{equation*}where
\begin{array} \ \mu_{f^{(L)}_j} = \frac{1}{P}\sum_{p=1}^{P}f^{(L)}_j\left(\mathbf{x}_p \right) \\ \sigma_{f^{(L)}_j} = \sqrt{\frac{1}{P}\sum_{p=1}^{P}\left(f^{(L)}_j\left(\mathbf{x}_p \right) - \mu_{f^{(L)}_j} \right)^2}. \end{array}1: input: Activation function $a\left(\cdot\right)$, number of $\left(L-1\right)$ layer units $U_{L-1}$
2: Construct $\left(L-1\right)$ layer batch normalized units: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,f^{(L-1)}_i\left(\mathbf{x}\right)$ for $i=1,\,...,U_{L-1}$
3: Compute linear combination: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, v = w_{0}^{(L)}+{\sum_{i=1}^{U_{L-1}}}{w_{i}^{(L)}\,f^{(L-1)}_i}\left(\mathbf{x}\right)$
4: Pass result through activation: $\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, f^{(L)}\left(\mathbf{x}\right) = a\left(v\right)$
5: Compute mean: $\mu_{f^{(L)}}$ / standard deviation $\sigma_{f^{(L)}}$ of: $\,\,\,\,\,\,\,\,\left\{f^{(L)}\left(\mathbf{x}_p\right) \right\}_{p=1}^P$
6: Standard normalize: $ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,
f^{(L)} \left(\mathbf{x} \right) \longleftarrow \frac{f^{(L)} \left(\mathbf{x}\right) - \mu_{f^{(L)}}}{\sigma_{f^{(L)}}}
$
7: output: Batch normalized $L$ layer unit $\,\, f^{(L)} \left(\mathbf{x} \right)$
where the inclusion of the tunable parameters $\alpha$ and $\beta$ (which are tuned along with the other parameters of a batch-normalized network) allows for greater flexibility.
tanh activation, and optimize with gradient descent for 10,000 steps.show_video(video_path_3)
We perform the same experiment, using the same activation and dataset, but using the batch-normalized version of the network.
show_video(video_path_4)
relu activation function. mylib2.show_histories()